CLDR-17884 Regenerate AddPopulationData, ConvertLanguageData, reduce standard out noise #3965

conradarcturus · 2024-08-15T20:07:32Z

I started this ticket because I was seeing a lot of noisy warnings and errors in the regular tests -- I ended up in a rabbit hole with the generated population data. This stack of commits updates the data inputs and fixes errors in the scripts so we can regenerate population data in a stable way now.

This PR completes the ticket.
Input data fixed and/or updated
Errors in input processing scripts fixed
Documentation added (eg. un-literacy.md)
Tests added/extended

Scripts ran:

mvn package -DskipTests=true
Re-ran these scripts, they need to be run more regularly, some changes happen
- java -jar tools/cldr-code/target/cldr-code.jar AddPopulationData # Runs successfully now, some changes happen
- java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData # Runs successfully now, some changes happen
These scripts do not run
- java -jar tools/cldr-code/target/cldr-code.jar WikipediaOfficialLanguages
- java -jar tools/cldr-code/target/cldr-code.jar GenerateMaximalLocales
Running tests on github (I still cannot locally run all of the tests*

Script output changes

A lot of the script Standard Out messages mentioned in the original ticket are now fixed and will not appear -- mostly from fixing input data sources and a few processing scripts. If there are legitimate errors in the future the warnings and errors will appropriately come back.

Suriname had 2 un-distinguished sources of literacy data, this will now take the max value of the two
- one was the overall number
- the other had filtered institutional data
Since the aggregate regions from world_bank_data.csv are now gone, there are no more warnings about aggregates without country codes, eg. "Sub-Saharan Africa (all income levels)`

Data changed:

country_language_population.tsv
- Fixed some areas where spaces were used that should the tabs -- this affected how scripts parsed Kara-Kalpak, bug introduced in CLDR-16953 add kaa_Latn and kaa_Cyrl locales #3657
- Added Cantonese (Traditional) yue row otherwise yue would disappear in the re-generated supplementalData.xml -- introduced in CLDR-17871 Create yue_Hant_CN stub locale #3945
factbook_gdp_ppp.csv & factbook_gdp_ppp.csv: CIA Factbook data updated and imported using the csv that's exported by the CIA's website -- see also the old CLDR update documentation.
- This will update all population counts in supplementalData.xml
- Some stale data was removed from the Factbook there I added missing entries to other_country_data.txt
other_country_data.txt: Added information that used to be in earlier versions of the CIA Factbook
world_bank_data.csv: Re-generated from the World Bank Website . See also the old CLDR update documentation.
- A big difference is that I correctly read the instructions and did not import the country aggregates, eg. "Sub-Saharan Africa (all income levels)`
alternate_country_names.txt: Removed no longer needed skipped names since we no longer import CIA Factbook aggregates

Consequences for`supplementalData.xml`

Official Languages:<language> territories tag should be the territories where the language is official -- so some entries updated. For instance Mocheno was incorrectly considered an official language of Italy in CLDR-17430 add mhn Mocheno locale #3665
Population counts are incremented, so some language population percentages may increase or decrease if the input data is absolute value (since the denominator changed)
GDPs also changed
Literacy Rates some have changes
- Note there was a wonderful bug where the UN literacy data was mis-parsed, so "96%" would be mis-read as "0.96%" -- I fixed that
References: The two Kara-Kalpak references are now grouped correctly, Chinese reference has been given more context too

ALLOW_MANY_COMMITS=true

java -jar tools/cldr-code/target/cldr-code.jar AddPopulationData

CLDR-17884 Check alternate country names without parentheses too CLDR-17884 Regex match add country note CLDR-17884 Remove world bank aggregates

macchiati · 2024-08-16T04:09:26Z

Looks great! I want to review it more tomorrow, but a couple of quick notes.

tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv

If we make sure this always has the same number of columns (eg same number of tabs per line), then github will format it as a table, which is a convenience.

official

Nice catch with Mocheno.
We do have a 'stricter' sense of official than in some sources. (Some countries have 'honorific' official languages, but they are not in practice; you can't do business with the government in those languages. We usually mark those as regional official.)

DavidLRowe · 2024-08-16T19:21:40Z

If we make sure this always has the same number of columns (eg same number of tabs per line), then github will format it as a table, which is a convenience.

This would mean adjusting comment lines at the top with the warning and adding tabs to the ends of lines that don't currently have references.

conradarcturus · 2024-08-17T00:12:23Z

If we make sure this always has the same number of columns (eg same number of tabs per line), then github will format it as a table, which is a convenience.

This would mean adjusting comment lines at the top with the warning and adding tabs to the ends of lines that don't currently have references.

We can do that -- but there is already a plan to break up this file into smaller pieces to get rid of the redundant country information too -- I'll look into that over the next few weeks.

Followup pull requests I'll modernize the documentation into markdown files and add better tests. If you like the idea -- I was thinking of adding a checksum attribute for generated sections like territoryInfo

xubom123

Help

conradarcturus added 4 commits August 14, 2024 14:06

CLDR-17884 Skip Suriname parsing

5a65a39

CLDR-17884 Fix issues re-generating xml from ConvertLanguageData

af1c81f

CLDR-17884 Update UN literacy import

936525b

java -jar tools/cldr-code/target/cldr-code.jar AddPopulationData

CLDR-17884 Update CIA Factbook

78146cb

CLDR-17884 Check alternate country names without parentheses too CLDR-17884 Regex match add country note CLDR-17884 Remove world bank aggregates

conradarcturus requested a review from srl295 August 15, 2024 20:07

github-actions bot assigned conradarcturus Aug 15, 2024

CLDR-17884 Style updates

46b2c58

conradarcturus requested a review from macchiati August 15, 2024 22:20

macchiati approved these changes Aug 16, 2024

View reviewed changes

conradarcturus merged commit b4e6abf into unicode-org:main Aug 16, 2024
12 checks passed

conradarcturus deleted the CLDR-17884-Fix-noisy-test-outputs branch August 16, 2024 16:44

xubom123 reviewed Aug 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-17884 Regenerate AddPopulationData, ConvertLanguageData, reduce standard out noise #3965

CLDR-17884 Regenerate AddPopulationData, ConvertLanguageData, reduce standard out noise #3965

conradarcturus commented Aug 15, 2024 •

edited

Loading

macchiati commented Aug 16, 2024

DavidLRowe commented Aug 16, 2024

conradarcturus commented Aug 17, 2024 •

edited

Loading

xubom123 left a comment

CLDR-17884 Regenerate AddPopulationData, ConvertLanguageData, reduce standard out noise #3965

CLDR-17884 Regenerate AddPopulationData, ConvertLanguageData, reduce standard out noise #3965

Conversation

conradarcturus commented Aug 15, 2024 • edited Loading

Scripts ran:

Script output changes

Data changed:

Consequences forsupplementalData.xml

macchiati commented Aug 16, 2024

DavidLRowe commented Aug 16, 2024

conradarcturus commented Aug 17, 2024 • edited Loading

xubom123 left a comment

Choose a reason for hiding this comment

conradarcturus commented Aug 15, 2024 •

edited

Loading

Consequences for`supplementalData.xml`

conradarcturus commented Aug 17, 2024 •

edited

Loading